feat: Add Amazon S3 Vectors document store integration by dotKokott · Pull Request #3149 · deepset-ai/haystack-core-integrations

dotKokott · 2026-04-13T12:49:32Z

Related Issues

fixes Amazon S3 Vectors (DocStore) #2110

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

Content stored as non-filterable metadata (AWS-recommended pattern for large text)
Cosine distance converted to similarity score (1 - distance) for Haystack convention
Blob data uses base64 encoding for round-trip fidelity
filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

top_k capped at 100 (service limit)
query_vectors does not return embedding data
40KB total metadata per vector, 2KB filterable
Only float32, cosine/euclidean, eventual consistency

How did you test it?

26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
hatch run test:all, hatch run fmt, hatch run test:types
Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations

…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086

- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy

Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests

- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned

…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)

dotKokott · 2026-04-13T14:18:48Z

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

Integration tests never actually run in CI — only locally by developers with AWS credentials
The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

anakin87 · 2026-04-15T07:52:35Z

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

dotKokott · 2026-04-15T10:57:12Z

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

I have tried all integration tests and examples on my AWS account.

However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load.

anakin87

I left some initial comments.
Will take a better look soon

anakin87 · 2026-04-20T07:37:10Z

@@ -0,0 +1,207 @@
+# amazon-s3-vectors-haystack


We want to have a very minimal README (see https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/README.md)

This info is useful but we'll put it in docs

Sounds good. Should I leave it in the README until we know where you will put the info? Or would you like me to minimize the README already?

yes let's minimize it, but please save this info somewhere to be re-used in docs

Matches the pattern used by the amazon_bedrock workflow: - top-level id-token: write permission - AWS_REGION env var - configure-aws-credentials step (skipped on fork PRs and dependabot) - integration tests gated on successful auth

Matches the repo convention used across other integrations.

anakin87 · 2026-04-23T06:29:27Z

+            for doc in batch:
+                if doc.embedding is None:
+                    msg = f"Document '{doc.id}' has no embedding. S3VectorsDocumentStore requires embeddings."
+                    raise DocumentStoreError(msg)


I would do this check not for the batch but initially for all docs, to raise an error before start writing

anakin87 · 2026-04-23T06:37:21Z

+from haystack_integrations.components.retrievers.amazon_s3_vectors import S3VectorsEmbeddingRetriever
+from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore
+
+


For integration tests of Document Stores, we should inherit from Haystack's DocumentStoreBaseExtendedTests, which already contains the necessary tests

Implementation tips

see pgvector for an example:

haystack-core-integrations/integrations/pgvector/tests/test_document_store.py

Line 34 in a311bf4

class TestDocumentStore(

haystack-core-integrations/integrations/pgvector/tests/test_filters.py

Line 30 in a311bf4

class TestFilters(FilterDocumentsTest):

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026

dotKokott added 7 commits April 13, 2026 15:28

fix: pin haystack-ai>=2.26.1 for FilterPolicy support

b264a95

dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28

dotKokott marked this pull request as ready for review April 13, 2026 13:39

dotKokott requested a review from a team as a code owner April 13, 2026 13:39

dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39

dotKokott marked this pull request as draft April 13, 2026 13:39

dotKokott marked this pull request as ready for review April 13, 2026 13:40

anakin87 requested changes Apr 20, 2026

View reviewed changes

dotKokott added 2 commits April 22, 2026 21:02

ci: add AWS authentication step for integration tests

8564008

Matches the pattern used by the amazon_bedrock workflow: - top-level id-token: write permission - AWS_REGION env var - configure-aws-credentials step (skipped on fork PRs and dependabot) - integration tests gated on successful auth

docs: use single backticks for inline code in docstrings

c9e8399

Matches the repo convention used across other integrations.

anakin87 requested changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Amazon S3 Vectors document store integration#3149

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 9 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

dotKokott commented Apr 13, 2026 •

edited

Loading

Uh oh!

dotKokott commented Apr 13, 2026

Uh oh!

anakin87 commented Apr 15, 2026

Uh oh!

dotKokott commented Apr 15, 2026 •

edited

Loading

Uh oh!

anakin87 left a comment

Uh oh!

Uh oh!

anakin87 Apr 20, 2026

Uh oh!

dotKokott Apr 22, 2026

Uh oh!

anakin87 Apr 23, 2026

Uh oh!

Uh oh!

anakin87 Apr 23, 2026

Uh oh!

anakin87 Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from haystack_integrations.components.retrievers.amazon_s3_vectors import S3VectorsEmbeddingRetriever
		from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore

Conversation

dotKokott commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

dotKokott commented Apr 13, 2026

CI: Integration tests need AWS credential setup

What needs to happen

Prerequisites (maintainer action required)

Uh oh!

anakin87 commented Apr 15, 2026

Uh oh!

dotKokott commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anakin87 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

dotKokott Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anakin87 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dotKokott commented Apr 13, 2026 •

edited

Loading

dotKokott commented Apr 15, 2026 •

edited

Loading

anakin87 Apr 23, 2026 •

edited

Loading